graph LR
A["Existing Chart Benchmarks<br/>(DVQA, FigureQA, ChartQA)<br/>Template-based, oversimplified"] --> B["Over-optimistic<br/>progress measures"]
B --> C["CharXiv<br/>2,323 real-world charts<br/>Expert-curated questions"]
C --> D["Realistic signal<br/>for chart understanding"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#f39c12,stroke:#333,color:#fff
style C fill:#27ae60,stroke:#333,color:#fff
style D fill:#3498db,stroke:#333,color:#fff
CharXiv Reasoning
A comprehensive benchmark for chart understanding in multimodal LLMs: 2,323 real-world charts from scientific papers with expert-curated questions
Keywords: CharXiv, chart understanding, multimodal LLM, visual reasoning, scientific charts, MLLM evaluation, descriptive questions, reasoning questions, Princeton NLP, NeurIPS 2024

Introduction
Charts are everywhere — in scientific papers, financial reports, dashboards, and presentations. Understanding charts requires more than reading text; it demands visual perception, data extraction, and multi-step reasoning across complex visual elements.
Most existing chart benchmarks use oversimplified, template-generated charts with formulaic questions, leading to over-optimistic estimates of AI progress. Open-source models can appear to outperform strong proprietary models on these benchmarks, yet a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%.
CharXiv addresses this by providing a comprehensive evaluation suite of 2,323 natural, challenging, and diverse charts sourced directly from arXiv scientific papers. All charts and questions are handpicked, curated, and verified by human experts. The result is a far more realistic and faithful measure of chart understanding capabilities.
“All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs.” — CharXiv Paper
What Is CharXiv?
CharXiv (Chart + arXiv) is a comprehensive evaluation benchmark for chart understanding in Multimodal Large Language Models (MLLMs). It consists of 2,323 high-resolution charts manually sourced from arXiv preprints, each paired with expert-curated questions that test both basic comprehension and complex reasoning.
Two Types of Questions
CharXiv tests two fundamentally different capabilities:
- Descriptive Questions — Examine basic chart elements (axis labels, legends, data values, chart type identification). Each chart has 4 descriptive questions (3 answerable + 1 unanswerable designed to test whether models can recognize when information is not available).
- Reasoning Questions — Require synthesizing information across complex visual elements, performing multi-step reasoning, comparing trends, and drawing conclusions. Each chart has 1 reasoning question.
This gives a total of ~11,600 questions across the full dataset (5 questions × 2,323 charts).
Key Characteristics
| Feature | Details |
|---|---|
| Total charts | 2,323 (sourced from arXiv preprints) |
| Validation set | 1,000 charts / 5,000 questions (used for leaderboard) |
| Test set | 1,323 charts / 6,615 questions |
| Question types | Descriptive (4 per chart) + Reasoning (1 per chart) |
| Answer format | Open-vocabulary short answers (easily verifiable) |
| Chart diversity | Line, bar, scatter, heatmap, box plot, radar, and more |
| Source | Real scientific charts from arXiv papers |
| Curation | All handpicked and verified by human experts |
| Evaluation | Zero-shot, natural instructions, automated scoring |
| Venue | NeurIPS 2024 |
graph TD
CX["CharXiv<br/>2,323 charts from arXiv"] --> D["Descriptive Questions<br/>(4 per chart)"]
CX --> R["Reasoning Questions<br/>(1 per chart)"]
D --> D1["Answerable (3)<br/>Axis labels, legends,<br/>data values"]
D --> D2["Unanswerable (1)<br/>Tests refusal ability"]
R --> R1["Multi-step reasoning<br/>Trend comparison,<br/>data synthesis"]
style CX fill:#e74c3c,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style R fill:#27ae60,color:#fff,stroke:#333
style D1 fill:#6cc3d5,color:#fff,stroke:#333
style D2 fill:#8e44ad,color:#fff,stroke:#333
style R1 fill:#56cc9d,color:#fff,stroke:#333
Who Built It?
CharXiv was developed by researchers at Princeton University’s Natural Language Processing Group (Princeton NLP), with contributions from the University of Wisconsin-Madison.
Publication
CharXiv was published at NeurIPS 2024, one of the premier machine learning conferences. The paper spans 121 pages with 90 figures, providing an exceptionally thorough analysis of chart understanding gaps across dozens of models.
| Resource | Link |
|---|---|
| arXiv paper | arxiv.org/abs/2406.18521 |
| Project page | charxiv.github.io |
| GitHub | github.com/princeton-nlp/CharXiv |
| License | CC BY-SA 4.0 (questions); chart copyrights belong to original arXiv authors |
What Skills Does It Test?
CharXiv tests the complete pipeline of chart understanding — from basic visual perception to complex multi-step reasoning. Unlike benchmarks that test specialized knowledge, CharXiv reveals whether AI models can actually read and reason about charts the way humans do.
graph TD
CX["CharXiv<br/>Chart Understanding"] --> VP["Visual Perception<br/>Chart type, layout,<br/>color mapping"]
CX --> DE["Data Extraction<br/>Reading values,<br/>axis labels, legends"]
CX --> TR["Trend Recognition<br/>Patterns, comparisons,<br/>outliers"]
CX --> MR["Multi-step Reasoning<br/>Synthesizing across<br/>visual elements"]
CX --> RA["Refusal Ability<br/>Recognizing when<br/>info is unavailable"]
style CX fill:#e74c3c,color:#fff,stroke:#333
style VP fill:#3498db,color:#fff,stroke:#333
style DE fill:#27ae60,color:#fff,stroke:#333
style TR fill:#f39c12,color:#fff,stroke:#333
style MR fill:#8e44ad,color:#fff,stroke:#333
style RA fill:#e67e22,color:#fff,stroke:#333
| Capability | What CharXiv Tests | Question Type |
|---|---|---|
| Visual perception | Identifying chart types, layout elements, color codes | Descriptive |
| Data extraction | Reading specific values from axes, legends, and data points | Descriptive |
| Refusal ability | Recognizing when requested information is not in the chart | Descriptive (unanswerable) |
| Trend analysis | Comparing trends across multiple series or time periods | Reasoning |
| Multi-step reasoning | Combining multiple chart elements to draw conclusions | Reasoning |
| Cross-element synthesis | Integrating information from different parts of a complex chart | Reasoning |
Why Existing Benchmarks Fall Short
Existing chart benchmarks like DVQA, FigureQA, and ChartQA subsets in MathVista use template-generated charts with predictable structures. Models can exploit these patterns without truly understanding charts. CharXiv exposes this by using:
- Real scientific charts with diverse and complex layouts
- Expert-curated questions that cannot be answered by pattern matching
- Unanswerable questions that test whether models know their limits
Current Leaderboard
The leaderboard below shows model performance on the CharXiv validation set (1,000 charts, 5,000 questions), evaluated in a zero-shot setting with natural instructions. We highlight the Reasoning accuracy (the harder and more discriminating metric) alongside Descriptive accuracy.
Source: CharXiv Leaderboard (consulted March 28, 2026). All models evaluated zero-shot on the validation set.
Top Performers
| Rank | Model | Type | Size | Reasoning (%) | Descriptive (%) |
|---|---|---|---|---|---|
| — | Human | — | — | 80.50 | 92.10 |
| 1 | o3 (high) | Proprietary | — | 78.60 | 95.00 |
| 2 | o4 mini (high) | Proprietary | — | 72.00 | 94.30 |
| 3 | Claude 3.7 Sonnet | Proprietary | — | 64.20 | — |
| 4 | Claude 3.5 Sonnet | Proprietary | — | 60.20 | 84.30 |
| 5 | GPT 4.1 mini | Proprietary | — | 56.80 | 88.40 |
| 6 | GPT 4.1 | Proprietary | — | 56.70 | 87.90 |
| 7 | GPT 4.5 | Proprietary | — | 55.40 | 90.00 |
| 8 | o1 (high) | Proprietary | — | 55.10 | 88.90 |
| 9 | Doubao 1.5 Pro | Proprietary | — | 54.40 | 84.30 |
| 10 | o1 | Proprietary | — | 52.60 | 87.45 |
Top Open-Source Models
| Rank | Model | Size | Reasoning (%) | Descriptive (%) |
|---|---|---|---|---|
| 1 | Qwen2.5-VL 72B | 72B | 49.70 | 87.40 |
| 2 | InternVL3 38B | 38B | 46.40 | 87.20 |
| 3 | InternVL3 78B | 78B | 46.00 | 85.10 |
| 4 | InternVL3 14B | 14B | 43.10 | 82.20 |
| 5 | Qwen2-VL 72B | 72B | 43.00 | 81.35 |
| 6 | Qwen2.5-VL 7B | 7B | 42.50 | 73.90 |
| 7 | Pixtral 12B | 12B | 42.40 | 68.12 |
| 8 | InternVL V2.5 38B | 38B | 42.40 | 79.60 |
| 9 | InternVL V2.5 78B | 78B | 42.40 | 82.30 |
| 10 | GPT 4.1 nano | — | 40.50 | 73.90 |
Key Observations
graph LR
A["Descriptive Tasks<br/>Top models: 87–95%<br/>Close to human (92%)"] --> C["Chart basics<br/>are becoming<br/>tractable"]
B["Reasoning Tasks<br/>Top model: 78.6%<br/>Human: 80.5%"] --> D["Reasoning gap<br/>is closing but<br/>still significant"]
E["Open-source<br/>Best: 49.7%<br/>reasoning"] --> F["Large gap vs<br/>proprietary models<br/>(78.6%)"]
style A fill:#27ae60,color:#fff,stroke:#333
style B fill:#f39c12,color:#fff,stroke:#333
style C fill:#3498db,color:#fff,stroke:#333
style D fill:#e74c3c,color:#fff,stroke:#333
style E fill:#8e44ad,color:#fff,stroke:#333
style F fill:#e74c3c,color:#fff,stroke:#333
- Descriptive accuracy is becoming tractable — Top proprietary models score 87–95%, approaching human performance (92%)
- Reasoning remains the bottleneck — The best model (o3 high) scores 78.6%, close to human level (80.5%), but most models score well below 60%
- Large proprietary-to-open gap on reasoning — The best open-source model (Qwen2.5-VL 72B at 49.7%) lags significantly behind o3 (78.6%)
- Domain-specific models underperform — Specialized chart models (ChartLlama, ChartGemma, etc.) score below 15%, far worse than general-purpose MLLMs
- Model scale matters for open-source — 72B+ models consistently outperform smaller variants on reasoning
Where to Explore the Benchmark
Dashboards and Leaderboards
| Resource | Description | Link |
|---|---|---|
| CharXiv Leaderboard | Official leaderboard with reasoning and descriptive breakdowns | charxiv.github.io/#leaderboard |
| CharXiv Project Page | Full introduction, examples, and music video overview | charxiv.github.io |
Dataset and Code
| Resource | Description | Link |
|---|---|---|
| Hugging Face Dataset | Full 2,323-chart dataset with questions and annotations | huggingface.co/datasets/princeton-nlp/CharXiv |
| GitHub Repository | Evaluation code, model configs, and documentation | github.com/princeton-nlp/CharXiv |
| arXiv Paper | Full 121-page technical paper with analysis | arxiv.org/abs/2406.18521 |
| CSV Results | Downloadable validation results for all models | charxiv.github.io/data/val_result.csv |
Load the Dataset
from datasets import load_dataset
dataset = load_dataset("princeton-nlp/CharXiv")
# Access validation set
val = dataset["validation"]
print(f"Validation charts: {len(val)}")Reasoning Question Breakdown
CharXiv reports reasoning accuracy broken down by sub-categories, revealing where models struggle most:
| Sub-category | What It Tests |
|---|---|
| Information Retrieval | Extracting specific values from complex charts |
| Comparison | Comparing data points, trends, or categories |
| Pattern Recognition | Identifying visual patterns across data series |
| Counting | Enumerating elements in dense or complex charts |
| Inference | Drawing conclusions not explicitly shown in the chart |
Why CharXiv Matters
graph LR
A["Template-based<br/>benchmarks"] --> B["Inflated scores<br/>on simple charts"]
B --> C["CharXiv exposes<br/>real gaps"]
C --> D["Better multimodal<br/>AI systems"]
A2["Reasoning gap<br/>overlooked"] --> B2["Models can describe<br/>but not reason"]
B2 --> C
C --> D2["Targeted research<br/>on chart reasoning"]
style A fill:#e74c3c,color:#fff,stroke:#333
style A2 fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style D2 fill:#3498db,color:#fff,stroke:#333
- Exposes inflated progress — Reveals that high scores on existing benchmarks don’t translate to real chart understanding
- Separates description from reasoning — Shows that models can extract data but struggle to reason about it
- Uses real-world charts — Scientific charts from arXiv are far more complex and diverse than template-generated ones
- Tests refusal ability — Unanswerable questions reveal whether models confabulate when information is missing
- Expert-curated quality — Every chart and question verified by human experts, ensuring meaningful evaluation
- Covers 97 models — The most comprehensive chart understanding leaderboard available
Video: CharXiv Reasoning Explained
Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀
Conclusion
CharXiv reveals a critical truth about multimodal AI: being able to describe a chart is not the same as understanding it.
- 2,323 real scientific charts from arXiv with ~11,600 expert-curated questions
- Built by Princeton NLP (Zirui Wang, Danqi Chen, Sanjeev Arora, and team), published at NeurIPS 2024
- The best model (o3 high) scores 78.6% on reasoning — approaching but not yet matching human performance of 80.5%
- Most models score well below 60% on reasoning, despite achieving 85%+ on descriptive questions
- Open-source models lag significantly — the best (Qwen2.5-VL 72B at 49.7%) is nearly 30 points behind the best proprietary model on reasoning
- Domain-specific chart models underperform general-purpose MLLMs, suggesting that targeted chart training alone is insufficient
As multimodal AI advances, CharXiv provides a rigorous, realistic benchmark for measuring genuine chart understanding — not just pattern matching on simplified templates. The gap between descriptive and reasoning performance highlights the fundamental challenge ahead: teaching AI to truly reason about visual data.
References
- Wang, Z., Xia, M., He, L., Chen, H., Liu, Y., Zhu, R., Liang, K., Wu, X., Liu, H., Malladi, S., Chevalier, A., Arora, S., Chen, D. “CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs.” arXiv preprint arXiv:2406.18521 (2024). arxiv.org/abs/2406.18521
- Princeton NLP. “CharXiv — Project Page and Leaderboard.” charxiv.github.io (consulted March 28, 2026)
- Princeton NLP. “CharXiv Dataset.” Hugging Face. huggingface.co/datasets/princeton-nlp/CharXiv
- Princeton NLP. “CharXiv GitHub Repository.” github.com/princeton-nlp/CharXiv
Read More
- Compare with the hardest academic benchmark — see Humanity’s Last Exam (HLE)
- Compare with the AGI fluid intelligence benchmark — see ARC-AGI-2
- Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
- Scale your evaluation infrastructure — see Scaling LLM Serving for Enterprise Production
- Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
- CharXiv Official Leaderboard
- CharXiv Dataset on Hugging Face
- CharXiv GitHub Repository
- CharXiv arXiv Paper (121 pages)